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Method and Apparatus for narrow to very wide instruction 

GENERATION FOR ARITHMETIC CHICUITRY 
Cross references to related Applications 

This application is related to the following provisional applications filed with the United States 
5 Patent and Trademark Office: 

Serial number 60/204,113, entitled "Method and apparatus of a digital arithmetic and 
memory circuit with coupled control system and arrays thereof, filed May 15, 2000 by Jennings, 
docket number ARITHOOIPR 

Serial number 60/215,894, entitled "Method and apparatus of a digital arithmetic and 
10 memory circuit with coupled control system and arrays thereof, filed July 5, 2000 by Jennings, 
docket number ARITH002PR; 

Serial number 60/217,353, entitled "Method and apparatus of a digital arithmetic and 
memory circuit with coupled control system and arrays thereof, filed July 1 1, 2000 by Jennings, 
docket number ARITH003PR; 
15 Serial number 60/231,873, entitled "Method and apparatus of a digital arithmetic and 

memory circuit with coupled control system and arrays thereof, filed September 12, 2000 by 
Jennings, docket number ARITH004PR; 

Serial number 60/261,066, entitled "Method and apparatus of a DSP resource circuit", filed 
January 1 1, 2001 by Jennings, docket number ARITH005PR; and 
20 Serial number 60/282,093, entitled "Method and apparatus of a DSP resource circuit", filed 

April 6, 2001 by Jennings, docket number ARITH006PR. 

This application claims priority from the following provisional applications filed with the United 
States Patent and Trademark Office: 

Serial number 60/3 14,4 1 1 , entitled "Method and apparatus for high speed calculation of non- 
25 linear functions", filed August 22, 2001 by Jennings, docket number ARITH007PR; 

Serial number 60/325,093, entitled "A 64 point FFT Engine", filed September 25, 2001 by 
Jennings, docket number ARITH008PR; 
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Serial number 60/365,41 6, entitled "Methods and apparatus compiling non-linear functions, 
matrices and instruction memories and the apparatus resulting therefrom", filed March 1 8, 2002 by 
Jennings and Landers, docket number ARITH010PR; 

Serial number 60/402,346, entitled "Method and apparatus providing time division 
5 multiplexed arithmetic resources for digital signal processing and emulation of instruction 
memories", filed August 9, 2002 by Jennings and Landers, docket number ARITH011PR; 

Serial number 60/60/416,607, entitled "Method and apparatus providing time division 
multiplexed arithmetic resources for digital signal processmg", filed Aug^t 9, 2002 by Jennings and 

Landers, docket number ARITH012PR; 
10 Serialnumber60/454,755,entitled"Memodandapparatusprovidmgconfigurablegenera^^ 

of a very long instruction word based upon a narrow instruction, and using a fixed package pinout 
to provide a spectrum of arithmetic capability, capacity, performance, programmability and 
memory", filed March 14, 2003 by Jennings and Landers, docket number ARITH013PR; and 

Serial number 60/470,100, entitled "Method and apparatus implementing and using at least 
15 one logarithmic calculator to optimize floating point performance in a graphics accelerator", filed 
May 13, 2003 by Jennings and Landers, docket number ARITH014PR. 

This application claims priority as a continuation in part from the following application filed with 
the United States Patent and Trademark Office: 

Serial Number 10/276,41, docket number ARITH001US, filed Nov. 12, 2002, which is the 
20 national stage application based upon, Serial number PCT/US 01/15,541, entitled "Method and 
apparatus of DSP resource allocation and use", filed May 14, 2001 by Jennings, docket number 
ARITH001; and 

Serial number 10/226,735, entitled "Method and apparatus for high speed calculation of non- 
linear functions and networks using non-linear function calculators in digital signal processing", 
25 docket number ARITH003, filed August 22, 2002. 

Technical Field 

This invention relates to very wide instructions controlling arithmetic resources. 
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Background of Invention: 

Today, digital systems in a variety of applications including both Digital Signal Processing (DSP 
hereafter) and graphics accelerators, require the performance of many complex algorithms. These 
algorithms often use a wide cross section of specialized non-additive operations and non-linear 
5 functions to achieve their desired results. 

These algorithmic requirements place significant strains on how data is processed in these 
application systems. On one hand, the more arithmetic resources processing the data, the greater the 
throughput. On the other hand, the more resources there are to control, the wider the instruction 
controlling these units needs to be, to provide the flexibility to optimally use these resources. 

10 The wider the instruction word, the greater the systems overhead in operating the data processing 
resources. The system overhead may include, but is not limited to, the interfacing to external memories, 
the external memories, the instruction cache, and the general layout issue of routing many wires 
carrying these instruction signals to where they are needed. All of these are significant problems, often 
greatly increasing the cost of production, operational heat generation, and the general feasibility of such 

15 solutions. 

Mechanisms and methods are needed to operate multiple data processing resources based upon a narrow 
instruction which can generate a wide instruction where needed. These methods and mechanisms need 
to minimize the routing and other overhead associated with moving wide instructions every cycle. 
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Summary of Invention: 

The invention includes a method and apparatus for generating a wide instruction controlling at least one 
data processing resource, local to that data processing resource, by accessing a local wide instruction 
memory based upon a narrow instruction, to generate at least part of the wide instruction. The local 
5 wide instruction memory can be accessed on every instruction cycle to reconfigure the controlled data 
processing resource(s). 

The data processing resources preferably include arithmetic resources acting on the logarithms of 
various operands, which can generate a spectrum of non-additive results as configured by the wide 
instructions. These arithmetic resources preferably provide at least some of the following: multiplicative 
1 0 products of at least two operands, multiplicative products using a power of at least one operand, such 
as the square root, the square, 1/the square root, a number raised to an operand, an operand raised to a 
specified power, which may be another operand, and the logarithm of an operand. 

An application of the invention to a graphics accelerator pipeline is sketched. The application is a shader 
calculator, which shows the use of a preferred narrow instruction controlling a data path including 16 
1 5 programmable arithmetic resources, known herein as logalus, which effect all the operations discussed 
above. These logalus may have at least 16 controls signals each, collectively requiring at least 256 
instruction bits. 

A further preferred embodiment permits the narrow instruction to include three fields, a designator field, 
a first narrow field and a second narrow field. The designator field is used by the local wide instruction 
20 memories to select which of the first and second narrow fields to use in accessing the memory for 
controls of a specific resource. 

One preferred use of this embodiment is in a graphics shader with four datapath columns. One 
designation may allow three of the four vertical datapaths to perform a 3-vector based operation, while 
the fourth vertical datapath may perform a different set of operations, often known as scalar processing. 
25 Another designation may allow all four columns to be used in a 4-vector based operation. 

Another preferred use of such embodiments is in a DSP application with four vertical datapath columns 
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allows independent use of two columns for complex number arithmetic, such as found in Fast Fourier 
Transforms (FFTs), while the remaining two columns may be used for separate purposes, which may 
involve other functions. 

The invention also includes methods and apparatus for translating a program using these data processing 
resources into the local wide instruction memory contents required to optimally use the data processing 
resources. 

These and many other advantages will become apparent to those skilled in the art upon considering the 
Figures, their description and the claims. 

Brief Description of Drawings: 

Figure 1 A shows a narrow instruction accessing a local wide instruction memory to create at least partly 
create one wide instruction presented to a logalu to configure the logalu to process at least two log- 
operands; 

Figure IB shows a local wide instruction memory providing wide instructions to more than one logalu; 

Figure 2 shows more than one local wide instruction memory, each providing wide instructions to more 
than one logalu, the logalus arranged in rows and columns; 

Figure 3 shows one embodiment of the logalu of Figures 1A to 2, receiving fours pairs of log-operands, 
with a wide instruction of 20 bits providing controls for selecting, shifting, negating, and blocking for 
four log-operand inputs to a log adder, which generates the log-result; 

Figure 4A shows the local wide instruction memory of Figure 1A, further receiving the narrow 
instruction including a designator field, a first narrow field and a second narrow field; 

Figure 4B shows the local wide instruction memory of Figure IB, receiving the narrow instruction as 
in Figure 4A; 

Figure 5A shows one of the local wide instruction memories of Figure 2, providing separate selected 
narrow instructions to the local wide memories associated with the two columns of logalus; 
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Figure SB shows an alternative local wide instruction memory of Figure 2, providing separate selected 
narrow instructions to each of the local wide memories associated with the logalus; and 

Figure 6 shows a preferred use of the local wide instruction memories of Figure 2 further providing 
wide instructions to additional units. 

5 Detailed Description of Drawings: 

The invention includes a method and apparatus for generating a wide instruction controlling at least one 
data processing resource, local to that data processing resource, by accessing a local wide instruction 
memory based upon a narrow instruction, to generate at least part of the wide instruction. The local 
wide instruction memory can be accessed on every instruction cycle to reconfigure the controlled data 
1 0 processing resource(s). 

The data processing resources preferably include arithmetic resources acting on the logarithms of 
various operands, which can generate a spectrum of non-additive results as configured by the wide 
instructions. These arithmetic resources preferably provide at least some of the following: multiplicative 
products of at least two operands, multiplicative products using a power of at least one operand, such 
15 as the square root, the square, 1/the square root, a number raised to an operand, an operand raised to a 
specified power, which may be another operand, and the logarithm of an operand. 

Figure 1A shows a narrow instruction 10 provided to a local wide instruction memory 100 to at least 
partly create the wide instruction 20 presented to a logalu 200 to configure the logalu 200 to process 
at least two, and in this Figure, four pairs of log-operands. The log-operand pairs are the following. 
20 LogAl 202-l,LogA2202-2;LogBl 204-1, LogB2 204-2; LogCl 206-1, LogC2 206-2; andLogDl 208- 
1, and LogD2 208-2. 

In certain embodiments of the invention's local wide instruction memory 100 receives a write 
instruction 30, as in Figure 1A. Preferably, the response of the local wide instruction memory 100 to 
the narrow instruction 10 is altered based upon the write instruction 30. 

25 The logalu 200 of Figure 1 A is configured by the wide instruction 20 to operate on the four pairs of log- 



6 



Docket No. ARITH004 

operands as shown in Figure 3. The logalu 200 receives fours pairs of log-operands 202-1,2 to 208-1,2, 
with a wide instruction 20 containing twenty bits 20-1 to 20-20. 

Wide instruction bits 20-1 to 20-4 control selection within the pairs of log-operands in Figure 3. 

Wide instruction bit 20-1 provides a control for Selmux 210-A to select between LogAl 202-1 
5 and LogA2 202-2 to create LogSelA 212-A which is provided to Shftmux 220-A. 

Wide instruction bit 20-2 provides a control for Selmux 210-B to select between LogBl 204-1 
and LogB2 204-2 to create LogSelB 212-B which is provided to Shftmux 220-B. 

• Wide instruction bit 20-3 provides a control for Selmux 210-C to select between LogCl 206-1 
and LogC2 206-2 to create LogSelC 212-C which is provided to Shftmux 220-C. 

1 0 • Wide instruction bit 20-4 provides a control for Selmux 210-D to select between LogD 1 208-1 
and LogD2 208-2 to create LogSelD 212-D which is provided to Shftmux 220-D. 

Wide instruction bits 20-5 to 20-12 control log-domain shifting of the selected log-operands in Figure 
3. 

Wide instruction bits 20-5,6 provide controls for Shftmux 220-A shifting LogSelA 212-A to 
1 5 create a LogSfhtA 222-A, which is provided to Negtvs 230-A. 

Wide instruction bits 20-7,8 provide controls for Shftmux 220-B shifting LogSelB 212-B to 
create a LogSfhtB 222-B, which is provided to Negtvs 230-B. 

Wide instruction bits 20-9,10 provide controls for Shftmux 220-C shifting LogSelC 212-C to 
create a LogSfhtC 222-C, which is provided to Negtvs 230-C. 

20 • Wide instruction bits 20-11,12 provide controls for Shftmux 220-D shifting LogSelD 212-D to 
create a LogSfhtD 222-D, which is provided to Negtvs 230-D. 

Wide instruction bits 20-13 to 20-16 control log-domain negation of the shifted, selected log-operands 
in Figure 3. 
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Wide instruction bit 20-13 provides a control for Negtvs 230-A to possibly negate LogSfhtA 
222-A, to create LogNegA 232-A. 

Wide instruction bit 20-14 provides a control for Negtvs 230-B to possibly negate LogSfhtB 
222-B, to create LogNegB 232-B. 

Wide instruction bit 20-15 provides a control for Negtvs 230-C to possibly negate LogSfhtC 
222-C, to create LogNegC 232-C. 

Wide instruction bit 20-16 provides a control for Negtvs 230-D to possibly negate LogSfhtD 
222-D, to create LogNegD 232-D. 

Wide instruction bits 20-17 to 20-20 control passing or blocking the possibly negated, shifted, selected 
log-operands to create the four processed log-operands 242-A to 242-D presented to the LogAdder4 
250, which generates the log domain result 210 in Figure 3. 

Wide instruction bit 20-17 provides a control for PasBlk 240- A to pass or block the LogNegA 
232-A to create the processed log-operand A 242-A. 

• Wide instruction bit 20-18 provides a control for PasBlk 240-B to pass or block the LogNegB 
232-B to create the processed log-operand B 242-B. 

• Wide instruction bit 20-19 provides a control for PasBlk 240-C to pass or block the LogNegC 
232-C to create the processed log-operand C 242-C. 

Wide instruction bit 20-20 provides a control for PasBlk 240-D to pass or block the LogNegD 
232-D to create the processed log-operand D 242-D. 

As used herein, a log calculator generates a log-operand by at least performing some version of a 
logarithm upon an operand. An exponential calculator generates a result by at least performing some 
version of an exponential upon its log-operand input. The logarithm and exponential are preferably, 
approximately inverses of each other for a wide range of inputs. Further, the logarithm and exponential 
are preferably evaluated base the number two. 
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The logalu 200 shown in Figures 1A and 3 effects the multiplicative product of the processed log 
operands 242-A to 242-D, upon the output result 302 from the exponential calculator 300 of Figure 1 A. 

The log result 210 generated by the logalu 200 of Figures 1A and 3 is provided to an exponential 
calculator 300 to generate the non-additive result 302, in Figure 1A. By way of example, assume that 
5 log-operand Al 202-1 is generated by a log calculator 310 as in Figure 6. Assume an operand A is 
presented to the Log calculator to create log-operand Al 202-1. The contribution of the processed log- 
operand A 242-A may have at least some of the following multiplicative effects on the non-additive 
result 302: 

• an approximation of the operand A, 

10 • an approximation of a square root of the operand A, 

an approximation of a multiplicative inverse of the operand A, 

an approximation of a multiplicative inverse of the square root of the operand A, 

• an approximation of a square of the operand A, and 

• an approximation of a multiplicative inverse of the square of the operand A. 
1 5 The approximations preferably satisfy a precision standard. 

Further, the precision standard preferably supports a member of a programming languages collection 
comprising: a version of Java, a version of C, a version of OpenGL, and a version of DirectX. Versions 
of C include, but are not limited to, standard C, Kernighan and Ritchie C, C++, ObjectiveC, Cg, and 
DspC. 

20 The systems overhead for each logalu 200 as shown in Figure 3 is twenty bits of control. When an array 
including 16 of these resources, as shown in Figures 2 and 6, is to be used, the price of independent 
programming capability for these resources alone is over 300 bits of control. Routing these signal long 
distances within an integrated circuit, much less transferring them to and from an external memory, or 
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caching them for access on every cycle, would be very expensive. 

The inventor realized that in at least graphics accelerator and DSP applications, application programs 
are relatively short, and can only use a relatively small number of distinct configurations of such 
resources. 

5 Figure IB shows a local wide instruction memory 100 providing at least partly separate wide 
instructions 20-1 to 20-4 associated with several logalus 200-1 to 200-4. 

Figures 2 and 6 show application of the invention to a graphics accelerator pipeline or a DSP resource 
array. These application may use a preferred narrow instruction of 6 to 8 bits to control a data path 
which may include 16 programmable logalu arithmetic resources. These logalu resources, in 
1 0 conjunction with exp calculators 300 of Figure 1 A and possibly log calculators 310 of Figure 6, effect 
at least all the operations discussed above. The logalus 300 as shown in Figure 3 have at least 16 
controls signals each, collectively requiring at least 256 instruction bits. One preferred use of this 
embodiment in applications with four datapath columns. 

A further preferred embodiment permits the narrow instruction 10 to include three fields, a designator 
15 field 12, a first narrow field 14 and a second narrow field 16, as shown in Figures 4A to 5B. The 
designator field 12 is used by the local wide instruction memories 100 to select which of the first and 
second narrow fields 14 and 16 to use in accessing 112 the local wide memory 120 for controls 20 of 
a specific resource. 

The means for selecting in of Figures 4A to 5B may include a selection configuration circuit 110, the 
20 designator field 12, in response to which, the circuit 110 selects from the first and second narrow fields 
14 and 16 to at least partly create at least one selected narrow instruction 112. 

In certain further preferred embodiments the selection configuration circuit 1 10 receives a configuration 
signal 32 as in Figure 4A. The configuration signal 32 may alter an internal state within the selection 
configuration circuit 110, which may further alter the selections based upon the designator field 12. 

25 The use of the designator 12 and two narrow fields 14 and 16, to a graphics accelerator may be seen in 
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the following example. One designation may allow three of the four vertical datapaths to perform a 3- 
vector based operation, while the fourth vertical datapath may perform a different set of operations, 
often known as scalar processing. Another designation may allow all four columns to be used in a 4- 
vector based operation. 

5 Another preferred use of the designator 12 and two narrow fields 14 and 16, in a DSP application with 
four vertical datapath columns may allow independent use of two columns for complex number 
arithmetic, such as found in Fast Fourier Transforms (FFTs), while the remaining two columns may be 
used for separate purposes, which may involve other functions. 



10 Figures 2 and 6 show the invention including more than one local wide instruction memory 100-1 and 
100-2, each providing at least partly separate wide instructions to more than one logalu. 

The logalus of Figures 2 and 6 are arranged in rows and columns as follows. Column i includes logalu- 
i,l, logalu-i,2, logalu-i,3, and logalu-i,4, for i=l,2,3, and 4. Row j includes logalu-1 j, logalu-1 j, 
logalu-3 j, and logalu-4j, forj=l,2,3 and 4. 

1 5 In certain further preferred embodiments, as shown in Figure 6, additional arithmetic resources may be 
provided the wide instruction at least partly generated by local wide instruction memories. Examples 
of these resources include, but are not limited to, log calculators 310, format converters from floating 
point to the logarithmic operand notation 320 and from the logarithmic operand notation to floating 
point 330. 

20 Figure 4A shows the local wide instruction memory of Figure 1A, further receiving the narrow 
instruction 10 including a designator field 12, a first narrow field 14 and a second narrow field 16. Such 
embodiments of the invention include a means for selecting the narrow address controlled at least partly 
by designator 12 from the first and second narrow fields 14 and 16 to create at least one selected narrow 
instructions 112. The selected narrow instruction 112 is presented to a local wide memory 120. The 

25 local wide memory 120 responds to the selected narrow instruction 112 to at least partly generate the 
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wide instruction 20. 

Figure 4B shows the local wide instruction memory 100 of Figure IB, receiving the narrow instruction 
10 as in Figure 4A, with multiple local wide memories 120-1 to 120-4, each presented at least partly 
separate selected narrow instructions 1 12-1 to 1 12-4. Each of the local wide memories 120-1 and 120^1, 
5 responds to its selected narrow instruction 112-1 to 112-4, creating the wide instructions 20-1 to 20-4. 

Figure 5A shows one of the local wide instruction memories 100-1 of Figures 2 and 6, providing 
separate selected narrow instructions 112-1 and 112-2 to the local wide memories associated with the 
two columns of logalus. 

Figure SB shows an alternative local wide instruction memory 100-2 of Figures 2 and 6, providing 
10 separate selected narrow instructions 1 12-1,1 through 1 12-2,4 to each of the local wide memories 120- 
1,1 through 120-2,4 associated with the logalus of Figures 2 and 6. 

The preceding embodiments of the invention have been provided by way of example and are not meant 
to constrain the scope of the following claims. 
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